Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port opentelemetry operator #358

Merged
merged 2 commits into from
Apr 17, 2024
Merged

port opentelemetry operator #358

merged 2 commits into from
Apr 17, 2024

Conversation

THUzxj
Copy link
Contributor

@THUzxj THUzxj commented Mar 13, 2024

Porting opentelemetry operator to acto.

@THUzxj
Copy link
Contributor Author

THUzxj commented Mar 17, 2024

Alarm 1

Description

State transition like:

# mutated 0
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: test-cluster
spec:
  config: "receivers:\n  otlp:\n    protocols:\n      grpc:\n      http:\nprocessors:\n\
    \  memory_limiter:\n    check_interval: 1s\n    limit_percentage: 75\n    spike_limit_percentage:\
    \ 15\n  batch:\n    send_batch_size: 10000\n    timeout: 10s\n\nexporters:\n \
    \ debug:\n\nservice:\n  pipelines:\n    traces:\n      receivers: [otlp]\n   \
    \   processors: []\n      exporters: [debug]"
---
# mutated 1
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: test-cluster
spec:
  config: "receivers:\n  otlp:\n    protocols:\n      grpc:\n      http:\nprocessors:\n\
    \  memory_limiter:\n    check_interval: 1s\n    limit_percentage: 75\n    spike_limit_percentage:\
    \ 15\n  batch:\n    send_batch_size: 10000\n    timeout: 10s\n\nexporters:\n \
    \ debug:\n\nservice:\n  pipelines:\n    traces:\n      receivers: [otlp]\n   \
    \   processors: []\n      exporters: [debug]"
  podAnnotations:
    actokey: actokey
---
# mutated 2
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: test-cluster
spec:
  config: "receivers:\n  otlp:\n    protocols:\n      grpc:\n      http:\nprocessors:\n\
    \  memory_limiter:\n    check_interval: 1s\n    limit_percentage: 75\n    spike_limit_percentage:\
    \ 15\n  batch:\n    send_batch_size: 10000\n    timeout: 10s\n\nexporters:\n \
    \ debug:\n\nservice:\n  pipelines:\n    traces:\n      receivers: [otlp]\n   \
    \   processors: []\n      exporters: [debug]"
  podAnnotations: {}
---
# mutated 3
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: test-cluster
spec:
  config: "receivers:\n  otlp:\n    protocols:\n      grpc:\n      http:\nprocessors:\n\
    \  memory_limiter:\n    check_interval: 1s\n    limit_percentage: 75\n    spike_limit_percentage:\
    \ 15\n  batch:\n    send_batch_size: 10000\n    timeout: 10s\n\nexporters:\n \
    \ debug:\n\nservice:\n  pipelines:\n    traces:\n      receivers: [otlp]\n   \
    \   processors: []\n      exporters: [debug]"
  podAnnotations:
    actokey2: actokey2

In the state mutated 2 and mutated 3, the podAnnotation actokey: actokey is not deleted. When describing the pod and deployment test-cluster-collector in mutated 3, users can see both actokey: actokey and actokey2: actokey2 in their annotations.

Explanation

It's a true alarm. When logging the variables related to the annotations in the operator (like desiredObjects in [this line](https://github.com/open-telemetry/opentelemetry-operator/blob/2ddfe3193a2dad57836ebbca5fb92481c35dbeff/controllers/common.go#L81)), we can find that the data are changed as expected. Thus maybe the problem is related to the k8s controller-runtime API in [this line](https://github.com/open-telemetry/opentelemetry-operator/blob/2ddfe3193a2dad57836ebbca5fb92481c35dbeff/controllers/common.go#L101).

Fix

Check and fix the usage of k8s controller-runtime API to ensure the system's status is changed as the variables in the operator.

Alarm 2

Description

State transition like:

# mutated 0
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: test-cluster
spec:
  config: "receivers:\n  otlp:\n    protocols:\n      grpc:\n      http:\nprocessors:\n\
    \  memory_limiter:\n    check_interval: 1s\n    limit_percentage: 75\n    spike_limit_percentage:\
    \ 15\n  batch:\n    send_batch_size: 10000\n    timeout: 10s\n\nexporters:\n \
    \ debug:\n\nservice:\n  pipelines:\n    traces:\n      receivers: [otlp]\n   \
    \   processors: []\n      exporters: [debug]"
---
# mutated 1
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: test-cluster
spec:
  config: "receivers:\n  otlp:\n    protocols:\n      grpc:\n      http:\nprocessors:\n\
    \  memory_limiter:\n    check_interval: 1s\n    limit_percentage: 75\n    spike_limit_percentage:\
    \ 15\n  batch:\n    send_batch_size: 10000\n    timeout: 10s\n\nexporters:\n \
    \ debug:\n\nservice:\n  pipelines:\n    traces:\n      receivers: [otlp]\n   \
    \   processors: []\n      exporters: [debug]"
  image: actokey

Explanation

It's a false alarm. The fake image can't be pulled thus the system will be in an error state.

Fix

Let Acto generate data closer to real ones. It can improve the test coverage.

Alarm 3

Description

Acto throws many alarms about the health status, but these alarms can not be steadily reproduced.

Explanation

These are false alarms. The pulled image is not included in preload images in context.json, which is ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector:0.95.0 . So in some trials, the pulling processes are failed. This is because the test-cluster is not ready when collecting the system-state. I find that in many trials test-cluster is not ready when checking the system state of the first several mutations.

Fix

To fix this problem, let Acto can collect the correct initiate state. We can also let acto throw errors when finding that the tested cluster is pulling images, thus the users can manually fix the context.

@tylergu
Copy link
Member

tylergu commented Mar 26, 2024

Hi @THUzxj , could you write a bug report for the first alarm in this bug template? https://github.com/open-telemetry/opentelemetry-operator/issues/new?assignees=&labels=bug%2Cneeds+triage&projects=&template=bug_report.yaml

You can then file this bug to the official developer and get some credit:)

@tianyin
Copy link
Member

tianyin commented Mar 26, 2024

The bug is cool. Would love to see the bug report.

@THUzxj
Copy link
Contributor Author

THUzxj commented Mar 27, 2024

The bug report has been submitted as above. The new commit fixes the missed preload_image manually and marks some field as "required" manually for deeper testing.

@tylergu tylergu merged commit 15ee26c into xlab-uiuc:main Apr 17, 2024
4 checks passed
RenZhou0327 pushed a commit to RenZhou0327/acto that referenced this pull request Apr 19, 2024
* port opentelemetry operator

* Add the missed preload_images and required field
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants